1 Climate change and temperature anomalies

If we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. The tabular data of temperature anomalies can be found here.

We load NASA temperature deviation data from 1951 to 1980 which is our reference period. We add skip and na as arguments to get select the area where there is data (ie after the first row) and determine how blanks will be displayed, respectively.

weather <- 
  read_csv("https://data.giss.nasa.gov/gistemp/tabledata_v4/NH.Ts+dSST.csv", 
           skip = 1, #dismiss header row
           na = "***") #define how missing values coded

We select only the columns that contain values for the 12 months from our original data set. We want to transpose the matrix of data to have one value per month per year, therefore pivoting long. We create a table with 3 columns for the year, month, and deviation which we call delta.

tidyweather <- weather %>%
  select(Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) %>% #select year and the 12 months variables
  pivot_longer(cols = 2:13, 
               names_to = "month", #rename variable
               values_to = "delta") #rename temperature deviation variable to delta

glimpse(tidyweather)
## Rows: 1,704
## Columns: 3
## $ Year  <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880…
## $ month <chr> "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "…
## $ delta <dbl> -0.34, -0.50, -0.22, -0.29, -0.05, -0.15, -0.17, -0.25, -0.22, -…

1.1 Plotting Information

Let us plot the data using a time-series scatter plot, and add a trendline. To do that, we first need to create a new variable called date in order to ensure that the delta values are plot chronologically.

tidyweather <- tidyweather %>%
  mutate(date = ymd(paste(as.character(Year), month, "1")), # we create a new variable called date, where we format the two seperate columns we had for year and month
         month = month(date, label=TRUE),
         year = year(date))

ggplot(tidyweather, aes(x=date, y = delta))+
  geom_point() +
  geom_smooth(method = 'lm',
              se = FALSE,
              color="red") + # we add the trendline to the scatter plot, coloring it red
  theme_bw() +
  labs(
    title = "Weather Anomalies",
    x = "Year",
    y = "Temperature deviation"
    ) +
  NULL

In the graph above we have aggregated the information regarding the whole dataset. Let’s have a look whether temperature deviation changes in each month. To do this we will use facet_wrap to plot the temperature deviation for each month so that we have 12 graphs in total.

ggplot(tidyweather, aes(x = date, y = delta))+
  geom_point(colour = "dark grey")+
  facet_wrap(~month) + # we create 12 different graphs, one for each month
  geom_smooth(method = 'lm' ,
              se = FALSE,
              color="magenta") +
  theme_bw() +
  labs(
    title = "Weather Anomalies",
    x = "Year",
    y = "Temperature deviation"
    ) +
  NULL

It is sometimes useful to group data into different time periods to study historical data. For example, we often refer to decades such as 1970s, 1980s, 1990s etc. to refer to a period of time. NASA calculates a temperature anomaly, as difference form the base period of 1951-1980. The code below creates a new data frame called comparison that groups data in five time periods: 1881-1920, 1921-1950, 1951-1980, 1981-2010 and 2011-present.

We remove data before 1800 and before using filter. Then, we use the mutate function to create a new variable interval which contains information on which period each observation belongs to. We can assign the different periods using case_when().

comparison <- tidyweather %>% 
  filter(Year >= 1881) %>%     #remove years prior to 1881
  #create new variable 'interval', and assign values based on criteria below:
  mutate(interval = case_when(
    Year %in% c(1881:1920) ~ "1881-1920",
    Year %in% c(1921:1950) ~ "1921-1950",
    Year %in% c(1951:1980) ~ "1951-1980",
    Year %in% c(1981:2010) ~ "1981-2010",
    TRUE ~ "2011-present"
    ))

head(comparison)
## # A tibble: 6 × 6
##    Year month delta date        year interval 
##   <dbl> <ord> <dbl> <date>     <dbl> <chr>    
## 1  1881 Jan   -0.3  1881-01-01  1881 1881-1920
## 2  1881 Feb   -0.21 1881-02-01  1881 1881-1920
## 3  1881 Mar   -0.03 1881-03-01  1881 1881-1920
## 4  1881 Apr    0.01 1881-04-01  1881 1881-1920
## 5  1881 May    0.04 1881-05-01  1881 1881-1920
## 6  1881 Jun   -0.32 1881-06-01  1881 1881-1920

Now that we have the interval variable, we create a density plot, displaying the distribution of monthly temperature deviations. We apply our previously created groupings to have a more condensed graph. Finally we use a different color for each grouping using fill.

ggplot(comparison, aes(x = delta, fill = interval)) + #fill according to different time intervals
  geom_density(alpha = 0.2) +   #density plot with transparency set to 20%
  theme_bw() +                #theme
  labs(
    title = "Density Plot for Monthly Temperature Anomalies",
    y = "Density",         #changing y-axis label to sentence case
    x = "Temperature deviation",
    fill = "Time interval" #rename legend
    ) +
  NULL

We can group by Year in order to have a look at annual anomalies. We then use summarise() to create the average annual deviation and plot it in a scatter plot.

#creating yearly averages
average_annual_anomaly <- tidyweather %>% 
  group_by(Year) %>%   #grouping data by Year
  summarise(annual_average_delta = mean(delta),
            na.rm = TRUE) # creating summaries for mean delta 

#plotting the data:
ggplot(average_annual_anomaly, aes(x = Year, y= annual_average_delta))+
  geom_point()+
  
  #Fit the best fit line, using LOESS method
  geom_smooth(method = "loess") +
  theme_bw() +
  labs(
    title = "Average Yearly Anomaly",
    y = "Average Annual Delta"
  ) +
  NULL

1.2 Confidence Interval for delta

As a next step, we construct a confidence interval for the average annual delta since 2011, both using a formula and using a bootstrap simulation with the infer package. Recall that the dataframe comparison has already grouped temperature anomalies according to time intervals; we are only interested in what is happening between 2011-present. For the formula method, we filter for data since 2011 and we calculate summary statistics for delta using summarise(). We use t-critical for the bottom 2.5% and top 2.5%, respectively.

formula_ci <- comparison %>% 
  filter(interval == "2011-present") %>%  # choose the interval 2011-present
  # calculate summary statistics for temperature deviation (delta), including mean, SD, count, SE, lower/upper 95% CI 
  summarise(
    #min_delta = min(delta, na.rm = TRUE),
    #max_delta = max(delta, na.rm = TRUE),
    #median_delta = median(delta, na.rm = TRUE),
    mean_delta = mean(delta, na.rm = TRUE),
    sd_delta = sd(delta, na.rm = TRUE),
    count = n(),
    se_delta = sd_delta/sqrt(count),
    t_critical = qt(0.975, count - 1 ),
    lower = mean_delta - t_critical * se_delta,
    upper = mean_delta + t_critical * se_delta,
  )

#print out formula_CI
formula_ci
## # A tibble: 1 × 7
##   mean_delta sd_delta count se_delta t_critical lower upper
##        <dbl>    <dbl> <int>    <dbl>      <dbl> <dbl> <dbl>
## 1       1.06    0.274   132   0.0239       1.98  1.01  1.11

Let’s now use the infer package to recreate our findings above. As it seems, both lower and upper boundaries are very similar, which is alos shown in the plot we have created below.

# use the infer package to construct a 95% CI for delta
library(infer)
set.seed(1234)

boot_delta <- comparison %>%
  filter(interval == "2011-present") %>%
  specify(response = delta) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") # we use 1000 iterations to calculate means

percentile_ci <- boot_delta %>% 
  get_confidence_interval(level = 0.95, type = "percentile") # we find the 95th percentile confidence interval

percentile_ci
## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     1.02     1.11
visualize(boot_delta) + 
  geom_vline(xintercept = formula_ci$lower,
             colour = "blue",
             linetype = "dashed",
             size=1.2) +
  geom_vline(xintercept = formula_ci$upper,
             colour = "blue",
             linetype = "dashed",
             size=1.2)+
  shade_ci(endpoints = percentile_ci,
           fill = "light grey") +
  scale_y_continuous(limits = c(0, 250),
                     expand = c(0,0)) +
  theme_bw()+
  labs(
    title = 'Bootstrap CI for Delta',
    subtitle = 'Formula CI shown with dotted blue lines',
    x = "Delta",
    y = "Count"
  ) +
  NULL

Explanation of what we have done and what the results are showing us

After having filtered for those observations between the year 2011 and now, we use the specify() function from the infer package in order to set the variable which we are interested in, which in our case is delta, the deviation of the actual temperature.

With generate(), we create 1,000 samples with replacement by setting type = "bootstrap" whereby samples of the size equal to the input sample size are drawn (with replacement) from the input sample data (see here). As we have specified stat = "mean" in calculate(), we then calculate the mean of delta for each of the 1,000 samples generated, which is the stat variable in the boot_delta dataframe where we store the results. With the get_confidence_interval() function we calculate the 95% confidence interval based on that data.

As a last step, we created a graph to show the distribution of calculated means, which follows a normal distribution in line with the central limit theorem, and plot this along with the 95% confidence interval. The blue dotted line shows the confidence interval we got from the first method that used the formula and the solid green lines show the confidence interval generated using the bootstrap method. We can see that the two confidence intervals are very similar in size, only the one generated from the simulation is slightly more to the right, meaning that both the lower and upper bound are slightly higher than the ones resulting from the formula.

Nevertheless, the resulting data tells us that we can be 95% confident that the true mean of delta, which is the deviation of the actual temperature from the expected temperature, lies between 1.02 and 1.11.

2 Global warming and political views (GSS)

We want to analyze whether there are any differences between the proportion of people who believe the earth is getting warmer and their political ideology. As usual, from the survey sample data, we will use the proportions to estimate values of population parameters. The file has 2253 observations on the following 2 variables:

  • party_or_ideology: a factor (categorical) variable with levels Conservative Republican, Liberal Democrat, Mod/Cons Democrat, Mod/Lib Republican
  • response : whether the respondent believes the earth is warming or not, or Don’t know/ refuse to answer
global_warming_pew <- read_csv(here::here("data", "global_warming_pew.csv"))

We find that there are three possible responses in the data set :

  • Earth is warming
  • Not warming
  • Don’t know / refuse to answer
global_warming_pew_long <- global_warming_pew %>% 
  count(party_or_ideology, response)

global_warming_pew_long
## # A tibble: 12 × 3
##    party_or_ideology       response                          n
##    <chr>                   <chr>                         <int>
##  1 Conservative Republican Don't know / refuse to answer    45
##  2 Conservative Republican Earth is warming                248
##  3 Conservative Republican Not warming                     450
##  4 Liberal Democrat        Don't know / refuse to answer    23
##  5 Liberal Democrat        Earth is warming                405
##  6 Liberal Democrat        Not warming                      23
##  7 Mod/Cons Democrat       Don't know / refuse to answer    45
##  8 Mod/Cons Democrat       Earth is warming                563
##  9 Mod/Cons Democrat       Not warming                     158
## 10 Mod/Lib Republican      Don't know / refuse to answer    23
## 11 Mod/Lib Republican      Earth is warming                135
## 12 Mod/Lib Republican      Not warming                     135

We will be constructing three 95% confidence intervals to estimate population parameters, for the % who believe that Earth is warming, according to their party or ideology.

We transpose the data set to have responses per party, disregarding “Don’t know / refuse to answer”. We then create a 95% confidence intervals for the the percentage of people believing in Global Warming according to their political standpoint.

global_warming_pew_wide <- global_warming_pew_long %>% 
  pivot_wider(names_from = response, # we pivot wide to have responses per party
              values_from = n) %>% 
select("party_or_ideology", "Earth is warming", "Not warming") # we do not pick "Don't know / refuse to answer" responses

colnames(global_warming_pew_wide)[2:3] <- c("warming", "not_warming") 

global_warming_pew_wide <- global_warming_pew_wide %>%
  mutate(count = warming + not_warming, #calculate CI using formula by hand
         p_hat = warming/count,
         se = sqrt(p_hat * (1 - p_hat)/count),
         z_critical = qt(0.975, count - 1),
         lower = p_hat - z_critical * se,
         upper = p_hat + z_critical * se)

global_warming_pew_wide
## # A tibble: 4 × 9
##   party_or_ideolo… warming not_warming count p_hat     se z_critical lower upper
##   <chr>              <int>       <int> <int> <dbl>  <dbl>      <dbl> <dbl> <dbl>
## 1 Conservative Re…     248         450   698 0.355 0.0181       1.96 0.320 0.391
## 2 Liberal Democrat     405          23   428 0.946 0.0109       1.97 0.925 0.968
## 3 Mod/Cons Democr…     563         158   721 0.781 0.0154       1.96 0.751 0.811
## 4 Mod/Lib Republi…     135         135   270 0.5   0.0304       1.97 0.440 0.560

We include a visualisation of the final dataset which shows that party allegiance is an indicator of global warming recognition.

global_warming_pew_wide %>% 
  arrange(desc(p_hat)) %>% 
  ggplot(aes(x = p_hat,
             y = fct_reorder(party_or_ideology, p_hat),
             color = party_or_ideology)) +
  geom_point() +
  
#Use geom_errorbarh to dipict the confidence interval  
  geom_errorbarh(aes(xmin = lower,
                 xmax = upper)) +
  scale_x_continuous(label = scales::percent) +
  labs(x = "Probabilty of Answering Earth is warming",
       y = "Party or Ideology") +
  theme_bw() +
  theme(text = element_text(size = 10),
        legend.position = "bottom",
        legend.title = element_blank(),
        plot.margin = unit(c(5.5, 30, 5.5, 5.5), "points"))    #Specify margin to avoid cut-off of the legend

To make sure, let’s run prop.test to test if there are significant differences among four groups’ population mean.

#Test for significant difference
prop.test(x = global_warming_pew_wide$warming,
          n = global_warming_pew_wide$count,
          conf.level = 0.95)
## 
##  4-sample test for equality of proportions without continuity
##  correction
## 
## data:  $ out of $global_warming_pew_wide out of global_warming_pew_widewarming out of count
## X-squared = 504, df = 3, p-value <2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 
##  0.355  0.946  0.781  0.500

The p-value is very small, indicating we can reject the null hypothesis of equality in proportions of the four groups’ population mean.

Given our samples and reading on The challenging politics of climate change, we can interpret that there is a high level of correlation between party ideology and belief in global warming. The more we move to the centre left of the political landscape, the more it seems that people acknowledge the fact that the temperature on our planet is becoming warmer. This is generally a theme in US politics, as republican leaders tend to undermine the importance of climate change, with their voters following suit.

What we need to point out though, is the difference in the samples. Although we would expect liberal democrats to be more sensitive regarding global warming, it is a much smaller sample compared to the conservative republican and mod/cons democrats. The same issue goes even more obvious for mod/lib republican. Therefore, we have a clearer picture for conservative republicans and mod/cons democrats, where the first group definitely disregard global warming while the latter recognise it as an important problem.

3 Biden’s Approval Margins

fivethirtyeight.com has detailed data on all polls that track the president’s approval. We load data for Biden’s approval ratings modify it for dates to display correctly.

# Import approval polls data directly off fivethirtyeight website
approval_polllist <- read_csv('https://projects.fivethirtyeight.com/biden-approval-data/approval_polllist.csv') 

# Use `lubridate` to fix dates, as they are given as characters.
approval_polllist <- approval_polllist %>%
  mutate(modeldate = mdy(modeldate),
         startdate = mdy(startdate),
         enddate = mdy(enddate),
         createddate = mdy(createddate),
         timestamp = hms(timestamp))

glimpse(approval_polllist)
## Rows: 1,819
## Columns: 22
## $ president           <chr> "Joseph R. Biden Jr.", "Joseph R. Biden Jr.", "Jos…
## $ subgroup            <chr> "All polls", "All polls", "All polls", "All polls"…
## $ modeldate           <date> 2021-10-01, 2021-10-01, 2021-10-01, 2021-10-01, 2…
## $ startdate           <date> 2021-01-19, 2021-01-19, 2021-01-20, 2021-01-20, 2…
## $ enddate             <date> 2021-01-21, 2021-01-21, 2021-01-21, 2021-01-21, 2…
## $ pollster            <chr> "Rasmussen Reports/Pulse Opinion Research", "Morni…
## $ grade               <chr> "B", "B", "B+", "B", "B-", "B", "B+", "B-", "B", "…
## $ samplesize          <dbl> 1500, 15000, 1516, 1993, 1115, 15000, 941, 1200, 1…
## $ population          <chr> "lv", "a", "a", "rv", "a", "a", "rv", "rv", "lv", …
## $ weight              <dbl> 0.3382, 0.2594, 1.2454, 0.0930, 1.1014, 0.2333, 1.…
## $ influence           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ approve             <dbl> 48, 50, 45, 56, 55, 51, 63, 58, 48, 52, 53, 55, 48…
## $ disapprove          <dbl> 45, 28, 28, 31, 32, 28, 37, 32, 47, 29, 29, 33, 47…
## $ adjusted_approve    <dbl> 50.5, 48.6, 46.5, 54.6, 53.9, 49.6, 58.7, 56.9, 50…
## $ adjusted_disapprove <dbl> 38.8, 31.3, 28.4, 34.3, 32.9, 31.3, 38.0, 33.1, 40…
## $ multiversions       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tracking            <lgl> TRUE, TRUE, NA, NA, NA, TRUE, NA, NA, TRUE, TRUE, …
## $ url                 <chr> "https://www.rasmussenreports.com/public_content/p…
## $ poll_id             <dbl> 74247, 74272, 74327, 74246, 74248, 74273, 74256, 7…
## $ question_id         <dbl> 139395, 139491, 139570, 139394, 139404, 139492, 13…
## $ createddate         <date> 2021-01-22, 2021-01-28, 2021-02-02, 2021-01-22, 2…
## $ timestamp           <Period> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

3.1 Create a plot

As a next step, we calculate the average net approval rate (approve - disapprove) for each week since he got into office. We plot the net approval, along with its 95% confidence interval. There are various dates given for each poll, so we use enddate, i.e., the date the poll ended.

Our final plot should look similar to this:

We produce a table called weekly_approval with Biden’s average net approval rating for every week he has been president. Then we calculate the 95% CIs. Before that, we check the differrent subgroups of polls in order to avoid duplicates.

approval_polllist %>% 
  count(subgroup)
## # A tibble: 3 × 2
##   subgroup      n
##   <chr>     <int>
## 1 Adults      519
## 2 All polls   864
## 3 Voters      436
# Calculate the weekly net approval rate
weekly_approval <- approval_polllist %>%

#filter the data because there can be more than 1 results from the same poll and each result is under a different subgroup, we only choose the "All polls" subgroup because it has the most samples  
  filter(subgroup == "All polls") %>% 
  
#select only relevant data
  select(president, enddate, samplesize, approve, disapprove, poll_id, question_id) %>% 
  
  mutate(net_approval = approve - disapprove, #calculate net approval
         week = week(enddate)) %>%  #get week of the year (data is only for 2021)
  
#calculate weekly mean and confidence interval
  group_by(week) %>% 
  summarise(mean_net_approval = mean(net_approval),
            sd_net_approval = sd(net_approval),
            count = n(),
            t_critical = qt(0.975, count - 1),
            se_net_approval = sd_net_approval/sqrt(count),
            margin_of_error = t_critical * se_net_approval,
            net_approval_low = mean_net_approval - margin_of_error,
            net_approval_high = mean_net_approval + margin_of_error)

head(weekly_approval)
## # A tibble: 6 × 9
##    week mean_net_approval sd_net_approval count t_critical se_net_approval
##   <dbl>             <dbl>           <dbl> <int>      <dbl>           <dbl>
## 1     3              18              8.89     5       2.78            3.97
## 2     4              18.3            9.44    23       2.07            1.97
## 3     5              16.6            7.70    25       2.06            1.54
## 4     6              16.7            8.37    18       2.11            1.97
## 5     7              16.4            7.57    24       2.07            1.54
## 6     8              14.9            7.86    24       2.07            1.60
## # … with 3 more variables: margin_of_error <dbl>, net_approval_low <dbl>,
## #   net_approval_high <dbl>

We plot the net approvals as a next step and find a diminishing trend. This could probably be explained by the recent political disaster at Afghanistan and the fact that US needs to borrow more money to comply with its outstanding bond payments.

#plot including week 7
ggplot(weekly_approval, aes(x = week, y = mean_net_approval)) +
  geom_point(color = 'orangered') +
  geom_smooth(se = F) +
  geom_line(color = 'orangered',
            size = 0.2,
            alpha = 0.6) +
  
#add a horizontal line for 0 (approval = disapproval)
  geom_line(aes(x = week, y = 0),
            color = 'orange',
            size = 2) +
  
#use geom_ribbon to depict confidence intervals
  geom_ribbon(aes(ymin = net_approval_low,
                  ymax = net_approval_high),
              fill = 'black',
              color = 'red',
              alpha = 0.1,
              size = 0.2) +
  
#change vertical gridlines
  scale_y_continuous(breaks = seq(-10, 20, by = 2.5),
                     minor_breaks = seq(-10, 20, 1.25)) +
  scale_x_continuous(breaks = seq(3, 39, by = 12),
                     minor_breaks = seq(3, 39, 6)) +
  
#some formatting
  ggplot2::annotate("text",
           x = 21, y = 30,
           label = "2021",
           size = 4)+
  theme_bw() +
  theme(title = element_text(size = 12),
        panel.border = element_blank()) +
  coord_cartesian(ylim = c(-12, 30))+
  labs(title = 'Estimating Approval Margin (approve-disapprove) for Joe Biden',
       subtitle = "Weekly average of all polls",
       x = "Week of the year", 
       y = "Average Approval Margin (Approve - Disapprove)") +
  NULL

3.2 Compare Confidence Intervals

From the graph we can clearly see that the confidence interval for week 25 is considerably smaller than for week 3, indicated by the grey area around it and the orange borders. Looking at the weekly_approval table, we see that the boundaries of the confidence interval for week 25 are (9.79, 13.47), whereas it is (6.96, 29.03) for week 3. This difference can be explained by the sample size we have for the two weeks. For week 3, we only have 5 polls whereas for week 25 we have 29. Given that the formula to calculate the margin of error is \(t \frac{s}{\sqrt{n}}\) where t is the critical t-value based on the t distribution that increases with decreasing degrees of freedom (\(n-1\)). In addition, the standard error \(\frac{s}{\sqrt{n}}\) is larger with a smaller \(n\). This means that the confidence interval will be smaller the higher our sample size, which is also what we see when we compare the intervals for week 3 and week 25.

4 Challenge 1: Excess rentals in TfL bike sharing

We get the TfL latest data by running the following, and create a dataset on how many bikes were hired every day.

url <- "https://data.london.gov.uk/download/number-bicycle-hires/ac29363e-e0cb-47cc-a97a-e216d900a6b0/tfl-daily-cycle-hires.xlsx"

# Download TFL data to temporary file
httr::GET(url, write_disk(bike.temp <- tempfile(fileext = ".xlsx")))
## Response [https://airdrive-secure.s3-eu-west-1.amazonaws.com/london/dataset/number-bicycle-hires/2021-09-23T12%3A52%3A20/tfl-daily-cycle-hires.xlsx?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJJDIMAIVZJDICKHA%2F20211003%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20211003T115004Z&X-Amz-Expires=300&X-Amz-Signature=a9801dd2f763ac4d67b47bd43a3ea72f6fd47b14eb106fcbf736994fb3e7bc80&X-Amz-SignedHeaders=host]
##   Date: 2021-10-03 11:50
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 174 kB
## <ON DISK>  /var/folders/gb/wrf5tvb121q80nhyf6fzxrc80000gn/T//RtmpsFpPZm/file592278694.xlsx
# Use read_excel to read it as dataframe
bike0 <- read_excel(bike.temp,
                   sheet = "Data",
                   range = cell_cols("A:B"))

# change dates to get year, month, and week
bike <- bike0 %>% 
  clean_names() %>% 
  rename (bikes_hired = number_of_bicycle_hires) %>% 
  mutate (year = year(day),
          month = lubridate::month(day, label = TRUE),
          week = isoweek(day))

We can easily create a facet grid that plots bikes hired by month and year.

First let’s reproduce the graph above (and update it to the most recent dataset).

#Create a density plot
bike %>% 
  filter(year >= 2015) %>% 
  ggplot(aes(x = bikes_hired)) +
  geom_density() +
  
#Facet by year and month
  facet_grid(rows = vars(year),
             cols = vars(month)) +
  
#Set the x-lab into units of K(thousand)
  scale_x_continuous(label = scales::label_number(suffix = "K",
                                                  scale = 0.001)) +
  
#Some formatting
  labs(x = "Bike Rentals",
       y = NULL,
       title = "Distribution of bikes hired per month")+
  theme_bw() +
  theme(text = element_text(size = 11),
        strip.background = element_rect(color = NA,
                                        fill = NA),
        panel.border = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_text(size = 7),
        axis.ticks.y = element_blank(),
        axis.text.y = element_blank()) +
  NULL

In May and June 2020, the distributions of daily bike rentals are visually flatter, and more uniformly distributed, than other months. Compared to previous and preceding months, May and June have several observations of low volumes of bikes being hired, likely due to the Covid-19 stay at home orders being ordered by the government, however these restrictions were phased out in June with non-essential shops and schools reopening. We believe this is reflected in the data by the more consistent distribution of daily rental volume than compared to other months.

For instance, during restricted periods we expect daily bike rental figures were low between 20-40k due to sociable/voluntary rental being banned and only key workers or those who cannot work from home using bikes to commute - other means of public transport were restricted or banned. However, once restrictions were relaxed we expect large numbers of bikes were rented by households who wanted to enjoy their new found freedoms outside of their homes. Consequently, because the rental figures were split somewhat equally across the distribution and not in a concentrated manner we see little positive kurtosis or concentration in the distribution creating a somewhat flat distribution different to other months.

Simply, the distribution caused to daily life has meant there is a wider spread of daily bike rentals over the two months and so neither possesses a histogram distribution similar to those of other months.

To better understand the excess rental phenomena, we will reproduce the two graphs below, one on monthly change and the other on weekly change from the expected level of monthly or weekly rentals. The two grey shaded rectangles in the second graph correspond to Q2 (weeks 14-26) and Q4 (weeks 40-52).

The expected number of rentals per week or month is calculated from data between 2016-2019 and we will see how each week or month of 2020-2021 compares to the expected rentals. We think of the calculation excess_rentals = actual_rentals - expected_rentals.

To decide whether we use mean or median to represent expected level of bike rentals, let’s first graph a box plot to see how daily data are distributed.

#Create a boxplot to see the pattern of distribution in 2016-2019
bike %>% 
  filter(year >= 2016, year <= 2019) %>% 
  ggplot(aes(y = bikes_hired)) +
  facet_wrap(~year,
             nrow = 2) +
  geom_boxplot() +
  NULL

We can see that the distributions of daily rentals in 2016-2019 are not very skewed and there are few outliers in the dataset. In this case, considering mean has a better statistical implication than median, we decide to use monthly or weekly mean as our expected levels of bike rentals.

We will first play with our daily data to make a monthly summary of bike rentals and make some transformation for the purpose of further plotting.

#Summarize mean of rentals by month
bike_monthly <- bike %>% 
  filter(year >= 2016) %>% 
  group_by(year, month) %>% 
  summarize(bikes_hired_monthly_mean = mean(bikes_hired))

#Calculate the expected monthly rentals, i.e. the mean of monthly mean in 2016-2019
bike_monthly_expected <- bike_monthly %>% 
  filter(year <= 2019) %>% 
  group_by(month) %>% 
  summarize(bikes_hired_monthly_expected = mean(bikes_hired_monthly_mean))

#Incorporate the expected monthly rentals to bike_monthly
bike_monthly <- left_join(bike_monthly,
                          bike_monthly_expected, 
                          by = "month")

#Create two new columns for excess(+/-) rentals per month compared with expected levels
bike_monthly <- bike_monthly %>% 
  mutate(bikes_hired_monthly_excess = pmax(bikes_hired_monthly_mean - bikes_hired_monthly_expected, 0),
         bikes_hired_monthly_shortage = pmax(bikes_hired_monthly_expected - bikes_hired_monthly_mean, 0))

head(bike_monthly)
## # A tibble: 6 × 6
## # Groups:   year [1]
##    year month bikes_hired_monthly_mean bikes_hired_monthly_… bikes_hired_monthl…
##   <dbl> <ord>                    <dbl>                 <dbl>               <dbl>
## 1  2016 Jan                     18914.                20617.                   0
## 2  2016 Feb                     20608.                22062.                   0
## 3  2016 Mar                     21435                 23237.                   0
## 4  2016 Apr                     25444.                28299.                   0
## 5  2016 May                     32699.                33270.                   0
## 6  2016 Jun                     32108.                35413.                   0
## # … with 1 more variable: bikes_hired_monthly_shortage <dbl>

Now we can reproduce the first graph.

#Lines for actual monthly rentals and expected monthly rentals
ggplot(bike_monthly, 
       aes(x = month)) +
  geom_line(aes(y = bikes_hired_monthly_mean,
                group = 1)) +
  geom_line(aes(y = bikes_hired_monthly_expected,
                group = 1),
            size = 1.5,
            color = "blue") +
  facet_wrap(~ year) +
  
#Use geom_ribbon for excess and shortage compared with expected levels
  geom_ribbon(aes(ymin = bikes_hired_monthly_expected,
                  ymax = bikes_hired_monthly_expected + bikes_hired_monthly_excess,
                  group = 1),
              fill = "green4",
              alpha = 0.25) +
  geom_ribbon(aes(ymin = bikes_hired_monthly_expected - bikes_hired_monthly_shortage,
                  ymax = bikes_hired_monthly_expected,
                  group = 1),
              fill = "red3",
              alpha = 0.25) +
  
#Some formatting
  scale_y_continuous(label = scales::comma,
                     limits = c(13000, NA)) +
  labs(x = "Month",
       y = "Bike Rentals",
       title = "Monthly changes in TfL bike rentals",
       subtitle = "Change from monthly average shown in blue \nand calculated between 2016-2019",
       caption = "Source: TfL, London Data Store") +
  theme_bw() +
  theme(text = element_text(size = 14),
        strip.background = element_rect(color = NA,
                                        fill = NA),
        panel.border = element_blank()) +
  NULL

The graph above depicts expected and real monthly changes in bike rentals across London. The blue line shows the expected daily number of bikes rented for a given month and the black line represents the actual number. The green or red fill shows the excess or under-use of bikes for that month compared to the expected figure.

  • The graph shows that for 2016, 2017 and 2019 actual usage is approximately in-line with expectation with an equal distribution of months marginally above and below expectation.

  • For 2018, June and July are considerable above expectation, however when compared to weather reports for those months this is perhaps expected. The summer of 2018 was the joint average hottest ever recorded in the UK with a peak temperature of 35.3 celsius, due to this a large excess of rentals is justified and in line with expectation.

  • As expected, 2020 and 2021 deviate from expectation considerably several times, almost certainly due to the Covid-19 pandemic. In March and April 2020, actual rental figures were well below expectation, likely because of the stay at home order issued by the UK government during these months. Then for May and June the daily rental average jumped sharply from <20k to >35k, this time likely due to the relaxation of some restrictions as well as key workers and those who couldn’t work from home commuting using bikes as other public transport provision was reduced or canceled. September 2020 was also well above expectation, likely due to the imposition of the “Rule of 6” for indoor and outdoor gatherings as well as behavioural changes in society - people looking to spend more time outside of their homes and outdoors in general having been forced to remain at home for many weeks. The rest of 2020 was fairly in line with expectation despite a circuit-breaker lockdown but this could be because the expectation of daily rentals dropped sharply already.

  • So far in 2021, only January and February have been well below expectation. This is likely due to the third national lockdown imposed by the government forcing households to remain at home. Since the relaxing of restrictions average daily rentals have remained somewhat in line with expectation.

We then do the same with weekly data but calculate excess(+/-) rentals as percent of expected weekly rentals.

Before building the dataset, we first need to check if our previous steps have correctly categorized data for week 1, 52 and 53. It is highly possible that the first few days of a year is labelled week 52 or 53 for the previous year, or the last few days of a year is labelled week 1 for the next year. Obviously you can’t have 10 days in your week 52 data.

#Display all week 1, 52 and 53 data
bike %>%
  filter(year >= 2016,
         week %in% c(1, 52, 53) )
## # A tibble: 87 × 5
##    day                 bikes_hired  year month  week
##    <dttm>                    <dbl> <dbl> <ord> <dbl>
##  1 2016-01-01 00:00:00        9922  2016 Jan      53
##  2 2016-01-02 00:00:00        7246  2016 Jan      53
##  3 2016-01-03 00:00:00        4894  2016 Jan      53
##  4 2016-01-04 00:00:00       20644  2016 Jan       1
##  5 2016-01-05 00:00:00       22934  2016 Jan       1
##  6 2016-01-06 00:00:00       23199  2016 Jan       1
##  7 2016-01-07 00:00:00       18225  2016 Jan       1
##  8 2016-01-08 00:00:00       20948  2016 Jan       1
##  9 2016-01-09 00:00:00       11674  2016 Jan       1
## 10 2016-01-10 00:00:00       14447  2016 Jan       1
## # … with 77 more rows

Our concern is correct: the first 3 days of 2016 are categorized into week 53, but it’s the week 53 of year 2015 instead of 2016. The same issue also occurs in all years in our current dataset. To rectify this, we need to change any mislabeled days in week 1, 52 or 53 and put them into the correct year.

#Create a new dataset to aviod overwriting
bike_adjust <- bike

#For all week 52 and 53 data in January, move their year label to the previous year
bike_adjust$year <- if_else(bike_adjust$week >= 52 & bike_adjust$month == "Jan",
                            bike_adjust$year - 1,
                            bike_adjust$year)

#For all week 1 data in December, move their year label to the next year
bike_adjust$year <- if_else(bike_adjust$week == 1 & bike_adjust$month == "Dec",
                            bike_adjust$year + 1,
                            bike_adjust$year)

#Display all week 1, 52 and 53 data
bike_adjust %>%
  filter(year >= 2016,
         week %in% c(1, 52, 53) )
## # A tibble: 84 × 5
##    day                 bikes_hired  year month  week
##    <dttm>                    <dbl> <dbl> <ord> <dbl>
##  1 2016-01-04 00:00:00       20644  2016 Jan       1
##  2 2016-01-05 00:00:00       22934  2016 Jan       1
##  3 2016-01-06 00:00:00       23199  2016 Jan       1
##  4 2016-01-07 00:00:00       18225  2016 Jan       1
##  5 2016-01-08 00:00:00       20948  2016 Jan       1
##  6 2016-01-09 00:00:00       11674  2016 Jan       1
##  7 2016-01-10 00:00:00       14447  2016 Jan       1
##  8 2016-12-26 00:00:00       11637  2016 Dec      52
##  9 2016-12-27 00:00:00       11003  2016 Dec      52
## 10 2016-12-28 00:00:00       12530  2016 Dec      52
## # … with 74 more rows

We have now taken care of the mislabeling issue. To make sure of this, let’s look at how many days are in each week of each year.

#Count how many days are in each week and order the table by the count
bike_adjust %>% 
  filter(year >= 2016) %>% 
  group_by(year, week) %>% 
  summarize(number_of_days = count(week)) %>% 
  arrange(number_of_days)
## # A tibble: 296 × 3
## # Groups:   year [6]
##     year  week number_of_days
##    <dbl> <dbl>          <int>
##  1  2021    35              2
##  2  2016     1              7
##  3  2016     2              7
##  4  2016     3              7
##  5  2016     4              7
##  6  2016     5              7
##  7  2016     6              7
##  8  2016     7              7
##  9  2016     8              7
## 10  2016     9              7
## # … with 286 more rows

Except for the week 35 of year 2021, which is our most recent data, all weeks in our bike_adjust dataset have 7 days. We can now proceed with further processing and visualization.

Note that we only have one sample of week 53 in our current dataset, and it is in year 2020. We don’t have any week 53 in 2016-2019 (except for the first 3 days of 2016, which we have already reclassified into year 2015), and therefore there is no expected weekly rentals for week 53. To address this anamoly, we just discard the week 53 data from our new dataset to avoid an NA value.

#Summarize mean of rentals by week
bike_weekly <- bike_adjust %>% 
  filter(year >= 2016, week != 53) %>% 
  group_by(year, week) %>% 
  summarize(bikes_hired_weekly_mean = mean(bikes_hired))

#Calculate the expected weekly rentals, i.e. the mean of weekly mean in 2016-2019
bike_weekly_expected <- bike_weekly %>% 
  filter(year <= 2019) %>% 
  group_by(week) %>% 
  summarize(bikes_hired_weekly_expected = mean(bikes_hired_weekly_mean))

#Incorporate the expected weekly rentals to bike_weekly
bike_weekly <- left_join(bike_weekly,
                         bike_weekly_expected,
                         by = "week")

#Create two new columns for percentage excess(+/-) rentals per week compared with expected levels
bike_weekly <- bike_weekly %>% 
  mutate(bikes_hired_weekly_diff = (bikes_hired_weekly_mean - bikes_hired_weekly_expected)/bikes_hired_weekly_expected,
         bikes_hired_weekly_excess = pmax(bikes_hired_weekly_diff, 0),
         bikes_hired_weekly_shortage = pmin(bikes_hired_weekly_diff, 0))

head(bike_weekly)
## # A tibble: 6 × 7
## # Groups:   year [1]
##    year  week bikes_hired_weekly_mean bikes_hired_weekly_ex… bikes_hired_weekly…
##   <dbl> <dbl>                   <dbl>                  <dbl>               <dbl>
## 1  2016     1                  18867.                 17388.             0.0851 
## 2  2016     2                  20255.                 22056.            -0.0817 
## 3  2016     3                  20937.                 21892.            -0.0436 
## 4  2016     4                  20550.                 21470.            -0.0428 
## 5  2016     5                  21230                  21194.             0.00169
## 6  2016     6                  19896.                 20226.            -0.0163 
## # … with 2 more variables: bikes_hired_weekly_excess <dbl>,
## #   bikes_hired_weekly_shortage <dbl>

Now we can reproduce the second graph.

#Line for the percentage change compared with expected weekly bike rentals
ggplot(bike_weekly, aes(x = week)) +
  geom_line(aes(y = bikes_hired_weekly_diff)) +
  facet_wrap(~ year) +

#Use geom_ribbon for percentage excess and shortage compared with expected levels   
  geom_ribbon(aes(ymin = 0,
                  ymax = bikes_hired_weekly_excess),
              fill = "green4",
              alpha = 0.25) +
  geom_ribbon(aes(ymin = bikes_hired_weekly_shortage,
                  ymax = 0),
              fill = "red3",
              alpha = 0.25) +
  
#Create the two grey shaded rectangles for Q2 and Q4
#geom_rect doesn't produce the desired effect in terms of alpha, so we use annotate(geom = "rect", ...) to optimize the result
  ggplot2::annotate(geom = "rect",
           xmin = 13, xmax = 26,
           ymin = -Inf, ymax = Inf,
           fill = "grey",
           alpha = 0.25) +
  ggplot2::annotate(geom = "rect",
           xmin = 39, xmax = 53,
           ymin = -Inf, ymax = Inf,
           fill = "grey",
           alpha = 0.25) +
  
#Use geom_rug for the small indicators at the bottom, with two differently sliced dataset from bike_weekly
  geom_rug(data = bike_weekly[bike_weekly[ , "bikes_hired_weekly_excess"] != 0, ,
                              drop = FALSE],
           color = "green4",
           sides = "b") +
  geom_rug(data = bike_weekly[bike_weekly[ , "bikes_hired_weekly_shortage"] != 0, ,
                              drop = FALSE],
           color = "red3",
           sides = "b") +
  
#Some formatting
  scale_x_continuous(breaks = c(13, 26, 39, 52)) +
  scale_y_continuous(label = scales::percent) +
  labs(x = "week",
       y = NULL,
       title = "Weekly changes in TfL bike rentals",
       subtitle = "% change from weekly averages \ncalculated between 2016-2019",
       caption = "Source: TfL, London Data Store") +
  theme_bw() +
  theme(text = element_text(size = 14),
        strip.background = element_rect(color = NA,
                                        fill = NA),
        panel.border = element_blank()) +
  NULL

The graph above maps weekly changes in bike rentals across London from 2016 to mid-2021.

  • We can infer from the figure that, from 2016-2019, in general, the greatest volatility in bike usage is observed at the beginning or towards the end of the calendar year. There is likely a seasonal element to this change, less usage as the weather changes or greater usuage if there is a warm/sunny weather spell during colder months for example. It is possible to argue as well that due to lower average daily figures for these parts of the year that a fixed increase or decrease in usage would represent a larger percentage change on the figure. For example, a 2000 change in daily average bike rental for a week during the month of January will register as almost a 10% change on the weekly chart above, however an equal change for a week in July will register as almost 5% and so appear as less of change despite the raw number increase being the same as January.

  • In line with expectation, 2020 was the most volatile year in terms of weekly changes out of the measurement period. In 2020 the +/-50% level was exceeded twice,this was quite remarkable given that, for the data observed, prior to 2020 the 50% level was never exceeded. The level has been exceeded once since 2020 however, in Spring 2021 a weekly rental rate was over 50% greater than the week before - this is likely a result of the relaxation of covid restrictions for hopefully the final time and the impact this had on user behaviour.

  • Seasonal patterns aren’t clear on this figure. There is no defined period of a calendar year where usage increases or decreases for all the observed years. For example, from week 26 to 39 there is an noticeable increase in 2016 but a clear decrease of rentals for 2017 and for other years there is contradictory changes over those weeks.

In creating your plots, we found these links useful:

5 Challenge 2: How has the CPI and its components changed over the last few years?

In this challenge, we will do the following:

  1. We can find the CPI components at FRED and have scraped the FRED website and pulled all of the CPI components into a vector.
  2. Once we have a vector of components, we can then pass it to tidyquant::tq_get(get = "economic.data", from = "2000-01-01") to get all data since January 1, 2000
  3. Since the data we download is an index with various starting dates, we need to calculate the yearly, or 12-month change. To do this we use the lag function, and specifically, year_change = value/lag(value, 12) - 1; this means that we are comparing the current month’s value with that 12 months ago lag(value, 12).
  4. After that, we order components so the higher the yearly change, the earlier does that component appear, but make sure that All Items CPI (CPIAUCSL) appears first.
  5. In order to understand the graph more easily, we also take care of some specific formatting.

Having done these, we should get this graph.

We first scrape the data from the FRED website with the following code.

url <- "https://fredaccount.stlouisfed.org/public/datalist/843"
library(rvest)
library(purrr)
library(htmlTable)

tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")

CPI_components <- purrr::map(tables, . %>% 
             rvest::html_table(fill=TRUE)%>% 
             janitor::clean_names())

cpi_components <- CPI_components[[2]] %>% 
  rename(component = series_id)

table <- cpi_components %>% 
  select(component) %>% 
  pull() %>% 
  tidyquant::tq_get(get = "economic.data", from =  "2000-01-01") %>% 
  rename(component = symbol) %>% 
  left_join(cpi_components, by="component") %>% 
  select(title, component, date, price) %>% 
  mutate(cpi_year_change = price/lag(price, 12) - 1)

glimpse(table)
## Rows: 12,704
## Columns: 5
## $ title           <chr> "Consumer Price Index for All Urban Consumers: Airline…
## $ component       <chr> "CUSR0000SETG01", "CUSR0000SETG01", "CUSR0000SETG01", …
## $ date            <date> 2000-01-01, 2000-02-01, 2000-03-01, 2000-04-01, 2000-…
## $ price           <dbl> 222, 230, 241, 240, 241, 245, 247, 250, 245, 236, 238,…
## $ cpi_year_change <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.0775…

We continue with modifying the dataframe we have in order to make it ready to plot and name this new dataframe table_to_plot.

table_to_plot <- table %>%
  filter(year(date) >= 2016) %>% 
  mutate(Color = if_else(cpi_year_change >= 0, "brown2", "steelblue1")) %>% #change color according to increase or decrease
  mutate(title = substr(title, 47, length(title)), #remove first characters of title that same for every component
         title = removeWords(title, "in U.S. City Average"), #remove words from each component
         title = fct_relevel(
           fct_reorder(title, cpi_year_change, max, .desc=TRUE), #order components according to yearly change
           "All Items ", after = 0)) #set "All Items" as first component

Having done this, we are now able to plot the data and get a similar to the graph shown above.

ggplot(data = table_to_plot,
       aes(x = date, y = cpi_year_change, color = Color)) +
  geom_point(size = 1, alpha = 0.6) +
  geom_smooth(se = FALSE, size = 1, color = "darkgrey") + #add smooth line to show trend
  facet_wrap(~title, scales = "free", nrow = 7) +
  scale_color_identity() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(title = "Yearly change of US CPI (All Items) and its components",
       subtitle = "YoY change being <span style = 'color: brown2;'>positive</span> or <span style = 'color: steelblue1;'>negative</span><br>Jan 2016 to Aug 2021", #code in HTML
       caption = "Data from St. Louis FRED\nhttps://fredaccount.stlouisfed.org/public/datalist/843",
       y = "YoY % Change",
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size = 12),
        plot.subtitle = element_markdown(), #set to markdown so that R recognizes that code written in HTML
        plot.title = element_text(face = "bold")) +
  NULL

This graph contains 49 different categories, but misses some very important ones such as Education or Medical Care. Since we are not able to get data for these categories from the method above, we import the provided csv file named cpi_data.csv and recreate the graph, which then has more categories.

cpi_data <- read.csv(here::here("data", "cpi_data.csv"))

cpi_data2 <- cpi_data %>% 
  select(title, component, date, value) %>% 
  mutate(date = ymd(date),
         cpi_year_change = value/lag(value, 12) - 1) %>% 
  filter(year(date) >= 2016) %>%
  mutate(Color = ifelse(cpi_year_change >= 0, "brown2", "steelblue1"),
         title = substr(title, 47, length(title)), #remove first characters of title that same for every component
         title = removeWords(title, "in U.S. City Average"), #remove words from each component
         title = fct_relevel(
           fct_reorder(title, cpi_year_change, max, .desc=TRUE), #order components according to yearly change
           "All Items ", after = 0)) #set "All Items" as first component

ggplot(data = cpi_data2,
       aes(x = date, y = cpi_year_change, color = Color)) +
  geom_point(size = 1, alpha = 0.6) +
  geom_smooth(se = FALSE, size = 1, color = "darkgrey") + #add smooth line to show trend
  facet_wrap(~title, scales = "free", nrow = 7) +
  scale_color_identity() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(title = "Yearly change of US CPI (All Items) and its components",
       subtitle = "YoY change being <span style = 'color: brown2;'>positive</span> or <span style = 'color: steelblue1;'>negative</span><br>Jan 2016 to Aug 2021", #code in HTML
       caption = "Data from St. Louis FRED\nhttps://fredaccount.stlouisfed.org/public/datalist/843",
       y = "YoY % Change",
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size = 12),
        plot.subtitle = element_markdown(), #set to markdown so that R recognizes that code written in HTML
        plot.title = element_text(face = "bold")) +
  NULL

This graphs is fine, but perhaps has too many sub-categories. According to relative importance of components in the Consumer Price Indexes: U.S. city average, December 2020, we can choose a smaller subset of the components and only list the major categories (Housing, Transportation, Food and beverages, Medical care, Education and communication, Recreation, and Apparel), sorted according to their relative importance.

#create vector to store major categories sorted according to relative importance
CPI_major_categories <- c("All Items ", "Housing ","Transportation ", "Food and Beverages ", "Medical Care ", "Education and Communication ", "Recreation ", "Apparel ")

table_major_categories <- cpi_data2 %>% 
  filter(title %in% CPI_major_categories) %>% #filter for those categories listed above
  mutate(title = factor(title,
                        levels = CPI_major_categories,
                        labels = CPI_major_categories))

ggplot(table_major_categories,
       aes(x = date, y = cpi_year_change, color = Color)) +
  geom_point(size = 1, alpha = 0.6) +
  geom_smooth(se = FALSE, size = 1, color = "darkgrey") + #add smooth line to show trend
  facet_wrap(~title, scales = "free", nrow = 2) +
  scale_color_identity() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  labs(title = "Yearly change of US CPI (All Items) and its major components according to their relative importance",
       subtitle = "YoY change being <span style = 'color: brown2;'>positive</span> or <span style = 'color: steelblue1;'>negative</span><br>Jan 2016 to Aug 2021", #code in HTML
       caption = "Data from St. Louis FRED\nhttps://fredaccount.stlouisfed.org/public/datalist/843",
       y = "YoY % Change",
       x = NULL) +
  theme_bw() +
  theme(text = element_text(size = 10),
        plot.subtitle = element_markdown(), #set to markdown so that R recognizes that code written in HTML
        plot.title = element_text(face = "bold")) +
  NULL

6 Details

  • Who did you collaborate with: team members of Study Group A14
  • Approximately how much time did you spend on this problem set: 8 hours on average for each team member
  • What, if anything, gave you the most trouble: formatting the plots and dealing with changing data pulled from the web

As a true test to yourself, do you understand the code you submitted and are you able to explain it to someone else?

Yes.

7 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.